AITopics | audio track

Collaborating Authors

audio track

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Brain Treebank: Large-scale intracranial recordings from naturalistic language stimuli

Neural Information Processing SystemsDec-26-2025, 23:08:40 GMT

We present the Brain Treebank, a large-scale dataset of electrophysiological neural responses, recorded from intracranial probes while 10 subjects watched one or more Hollywood movies. Subjects watched on average 2.6 Hollywood movies, for an average viewing time of 4.3 hours, and a total of 43 hours. The audio track for each movie was transcribed with manual corrections. Word onsets were manually annotated on spectrograms of the audio track for each movie. Each transcript was automatically parsed and manually corrected into the universal dependencies (UD) formalism, assigning a part of speech to every word and a dependency parse to every sentence. In total, subjects heard over 38,000 sentences (223,000 words), while they had on average 168 electrodes implanted. This is the largest dataset of intracranial recordings featuring grounded naturalistic language, one of the largest English UD treebanks in general, and one of only a few UD treebanks aligned to multimodal features. We hope that this dataset serves as a bridge between linguistic concepts, perception, and their neural representations. To that end, we present an analysis of which electrodes are sensitive to language features while also mapping out a rough time course of language processing across these electrodes.

artificial intelligence, natural language, proceedings, (10 more...)

Neural Information Processing Systems

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.96)

Add feedback

Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

Hayakawa, Akio, Ishii, Masato, Shibuya, Takashi, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceOct-8-2025

We propose a step-by-step video-to-audio (V2A) generation method for finer controllability over the generation process and more realistic audio synthesis. Inspired by traditional Foley workflows, our approach aims to comprehensively capture all sound events induced by a video through the incremental generation of missing sound events. To avoid the need for costly multi-reference video-audio datasets, each generation step is formulated as a negatively guided V2A process that discourages duplication of existing sounds. The guidance model is trained by finetuning a pre-trained V2A model on audio pairs from adjacent segments of the same video, allowing training with standard single-reference audiovisual datasets that are easily accessible. Objective and subjective evaluations demonstrate that our method enhances the separability of generated sounds at each step and improves the overall quality of the final composite audio, outperforming existing baselines.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.20995

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Workflow (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

Human Feedback Driven Dynamic Speech Emotion Recognition

Fedorov, Ilya, Korobchenko, Dmitry

arXiv.org Artificial IntelligenceAug-22-2025

This work proposes to explore a new area of dynamic speech emotion recognition. Unlike traditional methods, we assume that each audio track is associated with a sequence of emotions active at different moments in time. The study particularly focuses on the animation of emotional 3D avatars. We propose a multi-stage method that includes the training of a classical speech emotion recognition model, synthetic generation of emotional sequences, and further model improvement based on human feedback. Additionally, we introduce a novel approach to modeling emotional mixtures based on the Dirichlet distribution. The models are evaluated based on ground-truth emotions extracted from a dataset of 3D facial animations. We compare our models against the sliding window approach. Our experimental results show the effectiveness of Dirichlet-based approach in modeling emotional mixtures. Incorporating human feedback further improves the model quality while providing a simplified annotation procedure.

artificial intelligence, emotion, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2508.1492

Country: Europe (0.28)

Genre: Research Report > New Finding (0.49)

Industry: Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Hallucination Level of Artificial Intelligence Whisperer: Case Speech Recognizing Pantterinousut Rap Song

Horppu, Ismo, Ayala, Frederick, Gulbenkoglu, Erlin

arXiv.org Artificial IntelligenceJun-24-2025

All languages are peculiar. Some of them are considered more challenging to understand than others. The Finnish Language is known to be a complex language. Also, when languages are used by artists, the pronunciation and meaning might be more tricky to understand. Therefore, we are putting AI to a fun, yet challenging trial: translating a Finnish rap song to text. We will compare the Faster Whisperer algorithm and YouTube's internal speech-to-text functionality. The reference truth will be Finnish rap lyrics, which the main author's little brother, Mc Timo, has written. Transcribing the lyrics will be challenging because the artist raps over synth music player by Syntikka Janne. The hallucination level and mishearing of AI speech-to-text extractions will be measured by comparing errors made against the original Finnish lyrics. The error function is informal but still works for our case.

machine learning, natural language, whisperer, (20 more...)

arXiv.org Artificial Intelligence

2506.16174

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Information Technology (0.68)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Brain Treebank: Large-scale intracranial recordings from naturalistic language stimuli

Neural Information Processing SystemsMay-27-2025, 12:43:58 GMT

artificial intelligence, brain treebank, natural language, (10 more...)

Neural Information Processing Systems

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.59)

Add feedback

LASER: Lip Landmark Assisted Speaker Detection for Robustness

Nguyen, Le Thien Phuc, Yu, Zhuoran, Lee, Yong Jae

arXiv.org Artificial IntelligenceJan-21-2025

Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER). Unlike models that rely solely on facial frames, LASER explicitly focuses on lip movements by integrating lip landmarks in training. Specifically, given a face track, LASER extracts frame-level visual features and the 2D coordinates of lip landmarks using a lightweight detector. These coordinates are encoded into dense feature maps, providing spatial and structural information on lip positions. Recognizing that landmark detectors may sometimes fail under challenging conditions (e.g., low resolution, occlusions, extreme angles), we incorporate an auxiliary consistency loss to align predictions from both lip-aware and face-only features, ensuring reliable performance even when lip data is absent. Extensive experiments across multiple datasets show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals, demonstrating robust performance in real-world video contexts. Code is available at \url{https://github.com/plnguyen2908/LASER_ASD}.

artificial intelligence, face track, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2501.11899

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

DAIRHuM: A Platform for Directly Aligning AI Representations with Human Musical Judgments applied to Carnatic Music

Ravikumar, Prashanth Thattai

arXiv.org Artificial IntelligenceNov-22-2024

Quantifying and aligning music AI model representations with human behavior is an important challenge in the field of MIR. This paper presents a platform for exploring the Direct alignment between AI music model Representations and Human Musical judgments (DAIRHuM). It is designed to enable musicians and experimentalists to label similarities in a dataset of music recordings, and examine a pre-trained model's alignment with their labels using quantitative scores and visual plots. DAIRHuM is applied to analyze alignment between NSynth representations, and a rhythmic duet between two percussionists in a Carnatic quartet ensemble, an example of a genre where annotated data is scarce and assessing alignment is non-trivial. The results demonstrate significant findings on model alignment with human judgments of rhythmic harmony, while highlighting key differences in rhythm perception and music similarity judgments specific to Carnatic music. This work is among the first efforts to enable users to explore human-AI model alignment in Carnatic music and advance MIR research in Indian music while dealing with data scarcity and cultural specificity. The development of this platform provides greater accessibility to music AI tools for under-represented genres.

alignment, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2411.14907

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Singapore (0.04)
Asia > India (0.04)

Genre: Research Report > New Finding (0.49)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Meta has created a way to watermark AI-generated speech

MIT Technology ReviewJun-18-2024, 16:49:38 GMT

However, there are some big caveats. Meta says it has no plans yet to apply the watermarks to AI-generated audio created using its tools. Audio watermarks are not yet adopted widely, and there is no single agreed industry standard for them. And watermarks for AI-generated content tend to be easy to tamper with--for example, by removing or forging them. Fast detection, and the ability to pinpoint which elements of an audio file are AI-generated, will be critical to making the system useful, says Elsahar.

meta, watermark, watermark ai-generated speech, (6 more...)

MIT Technology Review

Country:

Europe > Austria > Vienna (0.18)
North America > United States > Illinois > Cook County > Chicago (0.06)

Technology: Information Technology > Artificial Intelligence (0.76)

Add feedback

ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of Video

Cai, Kevin, Liu, Chonghua, Chan, David M.

arXiv.org Artificial IntelligenceJan-10-2024

The Internet's wealth of content, with up to 60% published in English, starkly contrasts the global population, where only 18.8% are English speakers, and just 5.1% consider it their native language, leading to disparities in online information access. Unfortunately, automated processes for dubbing of video - replacing the audio track of a video with a translated alternative - remains a complex and challenging task due to pipelines, necessitating precise timing, facial movement synchronization, and prosody matching. While end-to-end dubbing offers a solution, data scarcity continues to impede the progress of both end-to-end and pipeline-based methods. In this work, we introduce Anim-400K, a comprehensive dataset of over 425K aligned animated video segments in Japanese and English supporting various video-related tasks, including automated dubbing, simultaneous translation, guided video summarization, and genre/theme/style classification. Our dataset is made publicly available for research purposes at https://github.com/davidmchan/Anim400K.

anim-400k, dataset, video, (17 more...)

arXiv.org Artificial Intelligence

2401.05314

Country: North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Speech (0.95)
Information Technology > Artificial Intelligence > Machine Learning (0.68)

Add feedback

Apple HomePod review: a Siri speaker with a bass problem

The GuardianMar-20-2023, 07:00:04 GMT

Apple's big, high-quality smart speaker is back for a surprise second generation. But five years since the first model was launched, a lot has changed in the world of voice-controlled home hi-fi. Can the HomePod still cut it? The new HomePod has the same design as the old version: a marshmallow-like shape with a light-up disc at the top, fabric-covered body and a small silicone foot. The detachable power cable slots in the back but otherwise there are no ports or recesses. As with other HomePods, this speaker is for Apple users only.

apple music, homepod, siri speaker, (9 more...)

The Guardian

Country:

Oceania > Australia (0.05)
North America > United States (0.05)
Europe > United Kingdom (0.05)

Industry:

Leisure & Entertainment (0.79)
Media > Radio (0.30)

Technology:

Information Technology > Communications > Mobile (0.78)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.56)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.56)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.35)

Add feedback